Step 1: Load the Datasets
import pandas as pd
# Load the datasets
transaction_data = pd.read_csv('transaction_data.csv')
purchase_behaviour = pd.read_csv('purchase_behaviour.csv')
# Inspect the first few rows of each dataset
print("Transaction Data:")
print(transaction_data.head(10))
print("\nPurchase Behaviour Data:")
print(purchase_behaviour.head(10))
Transaction Data:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR \
0 43390 1 1000 1 5
1 43599 1 1307 348 66
2 43605 1 1343 383 61
3 43329 2 2373 974 69
4 43330 2 2426 1038 108
5 43604 4 4074 2982 57
6 43601 4 4149 3333 16
7 43601 4 4196 3539 24
8 43332 5 5026 4525 42
9 43330 7 7150 6900 52
PROD_NAME PROD_QTY TOT_SALES
0 Natural Chip Compny SeaSalt175g 2 6.0
1 CCs Nacho Cheese 175g 3 6.3
2 Smiths Crinkle Cut Chips Chicken 170g 2 2.9
3 Smiths Chip Thinly S/Cream&Onion 175g 5 15.0
4 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8
5 Old El Paso Salsa Dip Tomato Mild 300g 1 5.1
6 Smiths Crinkle Chips Salt & Vinegar 330g 1 5.7
7 Grain Waves Sweet Chilli 210g 1 3.6
8 Doritos Corn Chip Mexican Jalapeno 150g 1 3.9
9 Grain Waves Sour Cream&Chives 210G 2 7.2
Purchase Behaviour Data:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
0 1000 YOUNG SINGLES/COUPLES Premium
1 1002 YOUNG SINGLES/COUPLES Mainstream
2 1003 YOUNG FAMILIES Budget
3 1004 OLDER SINGLES/COUPLES Mainstream
4 1005 MIDAGE SINGLES/COUPLES Mainstream
5 1007 YOUNG SINGLES/COUPLES Budget
6 1009 NEW FAMILIES Premium
7 1010 YOUNG SINGLES/COUPLES Mainstream
8 1011 OLDER SINGLES/COUPLES Mainstream
9 1012 OLDER FAMILIES Mainstream
Step 2: Understanding the Data Structure 🧐 To get a clear picture of the datasets, let's check:
✅ Column Names – What are the features available in the data?
✅ Data Types – Are they integers, floats, or strings?
✅ Shape of Data – How many rows and columns are there?
# Check the structure of the datasets
print("\nTransaction Data Info:")
print(transaction_data.info())
print("\nPurchase Behaviour Data Info:")
print(purchase_behaviour.info())
Transaction Data Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 264836 entries, 0 to 264835 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DATE 264836 non-null int64 1 STORE_NBR 264836 non-null int64 2 LYLTY_CARD_NBR 264836 non-null int64 3 TXN_ID 264836 non-null int64 4 PROD_NBR 264836 non-null int64 5 PROD_NAME 264836 non-null object 6 PROD_QTY 264836 non-null int64 7 TOT_SALES 264836 non-null float64 dtypes: float64(1), int64(6), object(1) memory usage: 16.2+ MB None Purchase Behaviour Data Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 72637 entries, 0 to 72636 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LYLTY_CARD_NBR 72637 non-null int64 1 LIFESTAGE 72637 non-null object 2 PREMIUM_CUSTOMER 72637 non-null object dtypes: int64(1), object(2) memory usage: 1.7+ MB None
Step 3: Handling Missing Values 🔍 Before diving deeper, let's check if any data is missing. Missing values can impact the accuracy of analysis, so we need to identify them first.
Here's the approach:
✅ Check for Missing Values – Count how many NaN (null) values exist in each column.
✅ Understand the Impact – If there are missing values, decide whether to fill or drop them.
# Check for missing values
print("\nMissing Values in Transaction Data:")
print(transaction_data.isnull().sum())
print("\nMissing Values in Purchase Behaviour Data:")
print(purchase_behaviour.isnull().sum())
Missing Values in Transaction Data: DATE 0 STORE_NBR 0 LYLTY_CARD_NBR 0 TXN_ID 0 PROD_NBR 0 PROD_NAME 0 PROD_QTY 0 TOT_SALES 0 dtype: int64 Missing Values in Purchase Behaviour Data: LYLTY_CARD_NBR 0 LIFESTAGE 0 PREMIUM_CUSTOMER 0 dtype: int64
Step 4: Identifying Duplicates 🔄 Duplicate rows can distort analysis, so let's check if any exist and decide how to handle them.
Here's the approach:
✅ Find Duplicates – Count how many duplicate rows are present.
✅ Decide on Action – Remove duplicates if necessary to keep the data clean.
# Check for duplicates
print("\nDuplicate Rows in Transaction Data:", transaction_data.duplicated().sum())
print("Duplicate Rows in Purchase Behaviour Data:", purchase_behaviour.duplicated().sum())
Duplicate Rows in Transaction Data: 1 Duplicate Rows in Purchase Behaviour Data: 0
# Remove duplicates
transaction_data = transaction_data.drop_duplicates()
purchase_behaviour = purchase_behaviour.drop_duplicates()
Step 5: Standardizing Column Names ✨ To maintain consistency and avoid errors, let's clean up the column names by:
✅ Removing Extra Spaces – No leading or trailing whitespace.
✅ Making Them Consistent – Convert to lowercase and replace spaces with underscores (_) for easier access.
✅ Ensuring Uniformity – Avoid special characters or inconsistencies.
# Standardize column names
transaction_data.columns = transaction_data.columns.str.strip()
purchase_behaviour.columns = purchase_behaviour.columns.str.strip()
# Verify column names
print("\nStandardized Column Names in Transaction Data:")
print(transaction_data.columns)
print("\nStandardized Column Names in Purchase Behaviour Data:")
print(purchase_behaviour.columns)
Standardized Column Names in Transaction Data:
Index(['DATE', 'STORE_NBR', 'LYLTY_CARD_NBR', 'TXN_ID', 'PROD_NBR',
'PROD_NAME', 'PROD_QTY', 'TOT_SALES'],
dtype='object')
Standardized Column Names in Purchase Behaviour Data:
Index(['LYLTY_CARD_NBR', 'LIFESTAGE', 'PREMIUM_CUSTOMER'], dtype='object')
Step 6: Converting Data Types 🔄 To make data analysis smoother, let's:
✅ Convert Date Columns – Change the DATE column to datetime format for easier manipulation.
✅ Ensure Numeric Columns Are Correct – Convert any incorrectly stored numbers (e.g., stored as text) to the right data type.
# Convert DATE column to datetime format
transaction_data['DATE'] = pd.to_datetime(transaction_data['DATE'], unit='D', origin='1899-12-30')
# Verify data types
print("\nUpdated Data Types in Transaction Data:")
print(transaction_data.dtypes)
Updated Data Types in Transaction Data: DATE datetime64[ns] STORE_NBR int64 LYLTY_CARD_NBR int64 TXN_ID int64 PROD_NBR int64 PROD_NAME object PROD_QTY int64 TOT_SALES float64 dtype: object
Step 7: Merging the Datasets 🔗 To get a complete view of the data, let's:
✅ Merge transaction_data & purchase_behaviour – Combine them using the LYLTY_CARD_NBR column (which likely represents customer IDs).
✅ Choose the Right Merge Type – Use an inner join to keep only matching records or outer join to retain all data.
# Merge datasets
merged_data = pd.merge(transaction_data, purchase_behaviour, on='LYLTY_CARD_NBR')
# Inspect the merged dataset
print("\nMerged Data:")
print(merged_data.head())
Merged Data:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR \
0 2018-10-17 1 1000 1 5
1 2019-05-14 1 1307 348 66
2 2019-05-20 1 1343 383 61
3 2018-08-17 2 2373 974 69
4 2018-08-18 2 2426 1038 108
PROD_NAME PROD_QTY TOT_SALES \
0 Natural Chip Compny SeaSalt175g 2 6.0
1 CCs Nacho Cheese 175g 3 6.3
2 Smiths Crinkle Cut Chips Chicken 170g 2 2.9
3 Smiths Chip Thinly S/Cream&Onion 175g 5 15.0
4 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.8
LIFESTAGE PREMIUM_CUSTOMER
0 YOUNG SINGLES/COUPLES Premium
1 MIDAGE SINGLES/COUPLES Budget
2 MIDAGE SINGLES/COUPLES Budget
3 MIDAGE SINGLES/COUPLES Budget
4 MIDAGE SINGLES/COUPLES Budget
Step 8: Check for Outliers Visualize outliers in the TOT_SALES column using an interactive boxplot with Plotly.
import plotly.express as px
# Interactive Boxplot for TOT_SALES
fig = px.box(merged_data, x='TOT_SALES', title='Boxplot of Total Sales')
fig.show()
Step 9: Identify Top 3 Most Profitable Products Group the data by PROD_NBR and PROD_NAME, then calculate total sales for each product.
# Group by PROD_NBR and PROD_NAME to calculate total sales
product_revenue = merged_data.groupby(['PROD_NBR', 'PROD_NAME'])['TOT_SALES'].sum().reset_index()
# Get the top 3 products by total sales
top_products = product_revenue.nlargest(3, 'TOT_SALES')
# Display the top 3 products
print("\nTop 3 Profitable Products:")
print(top_products)
Top 3 Profitable Products:
PROD_NBR PROD_NAME TOT_SALES
3 4 Dorito Corn Chp Supreme 380g 40352.0
13 14 Smiths Crnkle Chip Orgnl Big Bag 380g 36367.6
15 16 Smiths Crinkle Chips Salt & Vinegar 330g 34804.2
Step 10: Visualize Top 3 Products with Plotly Create an interactive bar chart to visualize the top 3 products.
# Interactive Bar Chart for Top 3 Products
fig = px.bar(top_products, x='PROD_NAME', y='TOT_SALES', title='Top 3 Profitable Products',
labels={'PROD_NAME': 'Product Name', 'TOT_SALES': 'Total Sales ($)'},
color='TOT_SALES', color_continuous_scale='Blues')
fig.update_xaxes(tickangle=45)
fig.show()
Step 11: Identify Characteristics of Loyal Customers Calculate loyalty metrics (Total_Spent, Frequency, Recency) for each customer.
# Calculate loyalty metrics
customer_metrics = merged_data.groupby('LYLTY_CARD_NBR').agg(
Total_Spent=('TOT_SALES', 'sum'),
Frequency=('TXN_ID', 'count'),
Recency=('DATE', 'max')
)
# Calculate recency in days from the latest date
latest_date = merged_data['DATE'].max()
customer_metrics['Recency'] = (latest_date - customer_metrics['Recency']).dt.days
# Define a loyalty score (weighted sum of metrics)
customer_metrics['Loyalty_Score'] = (
customer_metrics['Total_Spent'] * 0.5 +
customer_metrics['Frequency'] * 0.3 +
customer_metrics['Recency'] * -0.2
)
# Identify top loyal customers
loyal_customers = customer_metrics.nlargest(100, 'Loyalty_Score')
# Merge loyal customers with purchase behaviour
loyal_customer_profiles = loyal_customers.merge(purchase_behaviour, on='LYLTY_CARD_NBR')
# Analyze demographics of loyal customers
demographics_summary = loyal_customer_profiles.groupby(['LIFESTAGE', 'PREMIUM_CUSTOMER']).size().reset_index(name='Count')
# Display demographics summary
print("\nCharacteristics of Loyal Customers:")
print(demographics_summary)
Characteristics of Loyal Customers:
LIFESTAGE PREMIUM_CUSTOMER Count
0 MIDAGE SINGLES/COUPLES Mainstream 2
1 MIDAGE SINGLES/COUPLES Premium 3
2 OLDER FAMILIES Budget 15
3 OLDER FAMILIES Mainstream 10
4 OLDER FAMILIES Premium 9
5 OLDER SINGLES/COUPLES Budget 6
6 OLDER SINGLES/COUPLES Mainstream 3
7 OLDER SINGLES/COUPLES Premium 6
8 RETIREES Budget 4
9 RETIREES Mainstream 1
10 RETIREES Premium 2
11 YOUNG FAMILIES Budget 14
12 YOUNG FAMILIES Mainstream 7
13 YOUNG FAMILIES Premium 9
14 YOUNG SINGLES/COUPLES Budget 2
15 YOUNG SINGLES/COUPLES Mainstream 6
16 YOUNG SINGLES/COUPLES Premium 1
Step 12: Visualize Loyal Customer Characteristics with Plotly Create an interactive grouped bar chart to visualize the characteristics of loyal customers
# Interactive Bar Chart for Loyal Customer Characteristics
fig = px.bar(demographics_summary, x='LIFESTAGE', y='Count', color='PREMIUM_CUSTOMER',
title='Characteristics of Loyal Customers',
labels={'LIFESTAGE': 'Life Stage', 'Count': 'Number of Customers', 'PREMIUM_CUSTOMER': 'Customer Type'},
barmode='group')
fig.update_layout(xaxis={'categoryorder': 'total descending'})
fig.show()
Step 13: Generate Hypotheses Based on the findings, generate hypotheses about why loyal customers prefer certain products.
print("\nHypothesis:")
print("1. Loyal customers are likely to belong to specific LIFESTAGE groups (e.g., YOUNG FAMILIES or OLDER SINGLES/COUPLES).")
print("2. Premium customers may prefer high-quality or niche products, while Budget customers may focus on cost-effective items.")
print("3. Products preferred by loyal customers align with their lifestyle needs (e.g., family-oriented products for YOUNG FAMILIES).")
Hypothesis: 1. Loyal customers are likely to belong to specific LIFESTAGE groups (e.g., YOUNG FAMILIES or OLDER SINGLES/COUPLES). 2. Premium customers may prefer high-quality or niche products, while Budget customers may focus on cost-effective items. 3. Products preferred by loyal customers align with their lifestyle needs (e.g., family-oriented products for YOUNG FAMILIES).